Exploratory Data Analysis
Exploration
Summary Statistics
The following is a summary of the data.
| TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
| 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | |
| Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | |
| Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | |
| 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | |
| Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | |
| NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 |
Plots
The following density plots show the spread of the data. The red verticle line is the mean and the blue verticle line is the median. The scatter plot shows the relationship between wins and the variable
Missing Data
Batting Strike Outs
To fill the missing in the 102 missing data we will alternate between the two modes (578 and 909)
Pitching Strike Outs
To fill the the 102 missing values with the mean.
Double Plays
To fill the the 286 missing values with the mean.
Scaled and Combined
The idea behind this model is that teams that are better than the average will win more games and teams worse than the average will win less. The way we determine if a team is better than average is by looking at how well they preform at batting, pitching, and fielding.
Since there are more than one way to win a baseball game (i.e. have some power sluggers that hit home runs, vs have really good single batters.) we need to combine the various batting measures. Now since getting a strikout at bat is bad, we need to change the sign of this variable. That way it can be combined and will fit the better teams win more and worse teams less model.
We are going to scale all variables . That centers them at 0 and gives them a standard deviation of 1. We can then combine almost all the batting variables into one measure (hit by pitcher is excluded).
Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING + TEAM_PITCHING + TEAM_FIELDING,
data = training)
Residuals:
Min 1Q Median 3Q Max
-42.578 -8.197 0.281 8.526 49.155
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.8822 0.2885 280.33 <2e-16 ***
TEAM_BATTING 3.6415 0.1897 19.20 <2e-16 ***
TEAM_PITCHING 3.0643 0.2174 14.10 <2e-16 ***
TEAM_FIELDING -0.1567 0.2062 -0.76 0.447
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.35 on 1986 degrees of freedom
(286 observations deleted due to missingness)
Multiple R-squared: 0.2169, Adjusted R-squared: 0.2157
F-statistic: 183.3 on 3 and 1986 DF, p-value: < 2.2e-16
This model says that the average baseball team will win about 81 games. If their batting is one standard deviation better than the average they will win 4 more games. They will win 3 more games for if their pitching is better than average and 0 if their fielding.